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Abstract 

A new methodology for discrimination is pro- 
posed. This is based on kernel orthonormal- 
ized partial least squares (PLS) dimension- 
ality reduction of the original data space fol- 
lowed by support vector machines for classifi- 
cation. Close connection of orthonormalized 
PLS and Fisher’s approach to linear discrim- 
ination or equivalently with canonical corre- 
lation analysis is described. This gives pref- 
erence to use orthonormalized PLS over prin- 
cipal component analysis. Good behavior 
of the proposed method is demonstrated on 
13 different benchmark data sets and on the 
real world problem of the classification fin- 
ger movement periods versus non-movement 
periods based on electroencephalogram. 

1. Introduction 

The partial least squares (PLS) method (Wold, 1975; 
Wold et al., 1984) has been a popular modeling, re-, 
gression and discrimination technique in its domain of 
origin — Chemometrics. In its general form, PLS cre- 
ates score vectors (components, latent vectors) by us- 
ing the existing correlations between different sets of 
variables’ (blocks of daLa) - while' alscf keeping most” of 
the variance of both sets. PLS has proved to be use- 
ful in situations where the number of observed vari- 
ables is significantly greater than the number of ob- 
servations and high multicollinearity among the vari- 
ables exists. This situation is also quite common in 
the case of kernel-based learning where the original 
data are mapped to a high-dimensional feature space 
corresponding to a reproducing kernel Hilbert space 
(RKHS). Motivated by the recent results in kernel- 


based learning and support vector machines (Vapnik, 
1998; Scholkopf & Smola, 2002) a new form of dis- 
crimination is proposed. This is based on the kernel 
orthonormalized PLS method for dimensionality re- 
duction combined with the support vector machines 
classifier (SVC) (Vapnik, 1998; Scholkopf & Smola, 
2002 ). 

Consider the ordinary least squares regression with 
outputs Y to be an indicator vector coding two classes 
with two different labels representing class member- 
ship. The regression coefficient vector from the least 
squares solution is then proportional to the linear 
discriminant analysis (LDA) direction (Hastie et al., 
2001). This close connection between LDA and least 
square regression partially justified the use of PLS for 
discrimination. However, showing the close connection 
between Fisher’s LDA, canonical correlation analysis 
(CCA) and orthonormalized PLS methods, Barker and 
Rayens (2003) more rigorously justified the use of PLS 
for discrimination. This connection also shows the 
preference of using orthonormalized PLS or its non- 
linear kernel variant for dimensionality reduction in 
comparison to linear or nonlinear kernel-based princi- 
pal components analysis (PCA) for discrimination. 

In comparison to PLS regression on the dummy ma- 
trix Y the use of SVC on selected PLS score vec- 
tors (components). -is unotivated. by -the .possibility .of 
constructing an optimal separating hyperplane , a bet- 
ter control for overlap between classes when the data 
are not separable, using theoretically more principled 
“hinge” loss function in comparison to squared-error 
loss function and finally to avoid the problem of mask- 
ing of the classes in multi-class discrimination (Vapnik, 
1998; Hastie et al., 2001). 



2. RHKS - basic definitions 


3.1. Linear Partial Least Squares 


A RKHS is uniquely defined by a positive definite 
kernel function K(x, y); i.e., a symmetric function of 
two variables satisfying the Mercer theorem conditions 
(Scholkopf & Smola, 2002). Consider K{ .) to be de- 
fined on a compact domain X x X\ X C R N . The 
fact that for any such positive definite kernel there ex- 
ists a unique RKHS is well established by the Moore - 
Aronszjan theorem . The form K (x, y) has the follow- 
ing reproducing property 

/( y) = (/(*), JT(x, y))« v/ e n 

where (.,.)# is the scalar product in 7-L. The function 
K is called a reproducing kernel for rL. 

It follows from Mercer’s theorem that each positive 
definite kernel K(x,y ) defined on a compact domain 
X x X can be written in the form 

5 

K(x,y) = ^2 \i<pi(x)(j>i(y) S < oo (1) 
1=1 

where {0*(.)}|Li are the eigenfunctions of the integral 
operator 37# : L 2 (X) -> L 2 (X) 

(T#/)(x)= f K(x ) y)f(y)dy V/ E L 2 (X) 

Jx 

and {Aj > 0}^ x are the corresponding positive eigen- 
values. Rewriting (1) in the form 

5 _ 

K(x, y) = Y, V^Mx)y/\My) = $(x) T $(j/) (2) 
1=1 

it becomes clear that any kernel K(x,y) also corre- 
sponds to a canonical (Euclidean) dot product in a 
possibly high-dimensional space T where the input 
data are mapped by 

9: X-±T _ _ 

X-* {V^(t>l(x),s/\^<t> 2 (x), . . . ,y/\s<Ps(x)) 

The space T is usually denoted as a feature space and 
i", x G X } “as” feature mappings . The 
number of basis functions also defines the dimen- 
sionality of T. 

3. Partial Least Squares 

Because the PLS technique is not widely known, first 
a description of linear PLS is provided which will sim- 
plify the description of its nonlinear kernel-based vari- 
ant (Rosipal & Trejo, 2001). 


Consider a general setting of the linear PLS algorithm 
to model the relation between two data sets (blocks 
of observed variables) X and y. Denote by x € X C 
TZ n an IV-dimensional vector of variables in the first 
block of data and similarly y E y C 1Z M denotes a 
vector of variables from the second set. PLS models 
relations between these two blocks by means of latent 
variables. Observing n data samples from each block 
of variables, PLS decomposes the (n x N) matrix of 
zero-mean variables X and the ( n x M) matrix of zero- 
mean variables Y into the form 

X = TP t + F 

Y = UQ t + G (3) 

where the T, U are (n x p) matrices of the extracted p 
score vectors (components, latent vectors), the (Nxp) 
matrix P and the (M x p) matrix Q represent matri- 
ces of loadings and the (n x N) matrix F and the 
(n x M) matrix G are the matrices of residuals. The 
PLS method, which in its classical form is based on 
the nonlinear iterative partial least squares (NIPALS) 
algorithm (Wold, 1975), finds weight vectors w, c such 
that 

[cou(t,u)] 2 — [ccw(Xw, Yc)] 2 = 

= max| r | = | s | = i [cot/(Xr, Ys)] 2 

where cou(t,u) = t T u/n denotes the sample covari- 
ance between the two score vectors (components). The 
NIPALS algorithm starts with random initialization of 
the Y-score vector u and repeats a sequence of the fol- 
lowing steps until convergence: 

1) w = X T u/(u T u) 4) c = Y T t/ (t T t) 

2) j|wj| ->1 5) u = Yc/(c r c) 

3) t = Xw 6) repeat steps 1. — 5. 

After the convergence, by regressing X on t and Y 
on u, the loading vectors p = (t T t) _1 X T t and q = 
(u t u)“ 1 Y t u can be computed. 

However, it can be shown that the weight vector w also 
corresponds to the first eigenvector of the following 
eigenvalue problem (-Hoskuldsson, 1988) ---- 

X t YY t Xw = Aw (4) 

The X-scores t are then given as 

t = Xw (5) 

Similarly, eigenvalue problems for the extraction of t,u 
and c estimates can be derived (Hoskuldsson, 1988). 
The nonlinear kernel PLS method is based on mapping 



the original input data into a high- dimensional feature 
space T. In this case the vectors w and c cannot be 
usually computed. Thus, the NIPALS algorithm needs 
to be reformulated into its kernel variant (Lewi, 1995; 
Rosipal & Trejo, 2001). Alternatively, the score vector 
t can be directly estimated as the first eigenvector of 
the following eigenvalue problem (Hoskuldsson, 1988) 
(this can be easily shown by multiplying both sides of 
(4) by X matrix and using (5)) 

XX r YY T t = At (6) 

The Y-scores u are then estimated as 

u = YY r t ' (7) 

3.2. Nonlinear Kernel Partial Least Squares 

Now, consider a nonlinear transformation of x into a 
feature space T . Using the straightforward connection 
between a RKHS and T, Rosipal and Trejo (2001) 
have extended the linear PLS model into its nonlinear 
kernel form. Effectively this extension represents the 
construction of a linear PLS model in T. Denote $ 
as the (n x 5) matrix of mapped ^f-space data 4>(x) 
into an S'-dimensional feature space T. Instead of an 
explicit mapping of the data property (2) can be used 
resulting in 

= K 

where K represents the (n x n) kernel Gram matrix of 
the cross dot products between all input data points 
{^(x)}^!, that is, Kij = K(x i} xj ) where K(., .) is a 
selected kernel function. Similarly, consider a mapping 
of the second set of variables y into a feature space T\ 
and denote by the (n x Si) matrix of mapped y~ 
space data 4>(y) into an Si -dimensional feature space 
T\ . Analogous to K define (n x n) kernel Gram matrix 
K x 

= k x 

given by the kernel function K \ (.,.). Using this nota- 
tion the estimates of t (6) and u (7) can be reformu- 
lated into its nonlinear kernel variant 

KKit = At /q\ 

u = Kit (8 ' 

At the beginning of this section a zero-mean regression 
model was assumed. To centralize the mapped data 
in a feature space T the following procedure must be 
applied (Scholkopf et al., 1998; Rosipal & Trejo, 2001) 

K ^(In - -l„lDK(I n " -lnlZ) (9) 

n n 

where I n is an n-dimensional identity matrix and l n 
represent the (n x 1) vector with elements equal to one. 
The same is true for Ki . 


After the extraction of new score vectors t, u the ma- 
trices K and Ki are deflated by subtracting their rank- 
one approximations based on t and u. The differ- 
ent forms of deflation correspond to different forms 
of PLS (see Wegelin (2000) for a review). Because 
(4) corresponds to the singular value decomposition 
of the transposed cross-product matrix X r Y, com- 
putation of all eigenvectors from (4) at once involves 
implicit rank-one deflation of the overall transposed 
cross-product matrix. Although the weight vectors 
{wi}^rf im(iV,M) will be mutually orthogonal the corre- 
sponding score vectors {t;}* =1 , in general, will not be 
mutually orthogonal. The same is true for the weight 
vectors {c}*^ and the score vectors {u;}* =1 . This 
form of PLS was used by Sampson et al. (1989) and 
in accordance with Wegelin (2000) it is denoted as 
PLS-SB. The kernel analog of PLS-SB results from the 
computation of all eigenvectors of (8) at once. PLS1 
(one of the blocks has single variable) and PLS2 (both 
blocks are multidimensional) generally used as regres- 
sion methods use a different form of deflation. The 
deflation in the case of PLS1 and PLS2 is based on 
rank-one reduction of the 3> and 4* matrices using 
a new extracted score vector t at each step. It can 
be written in the kernel form for K matrix as follows 
(Rosipal & Trejo, 2001) 

K <- (I n - tt T )K(I n - tt T ) 

and in the same way for K x . This deflation is based 
on the fact that the <£ matrix is deflated as 3> 

$ — tp T = — tt r $, where p is the vector of load- 

ings corresponding to the extracted score vector t. 
Similarly for the ^ matrix the deflation has the form 
— tc T = \I> - tt T \I>. In the case of PLS1 and 
PLS2 score vectors {ti}£_ x are mutually orthogonal. 
In general, this is not true for {ui}* =1 (Hoskuldsson, 
1988). 

4. Fisher’s LDA, CCA and PLS 

Consider a set of iV-dimensional samples {x^ € X C 
representing the data from g classes (groups). 
Denote by 0* a (k x 1) vector of all zeros. Now, define 
the (n x g — 1) .class membership matrix Y to be 

Om • • • Om \ 

I7I2 * * * Afi 2 

• "• Li*-! 

On* • * • 0 ri g ) 

where {ni } 9 i=1 denotes the number of samples in each 
class and ^2i=i n * “ n * I n same way as in 

Barker and Rayens (2003), let S x = ^--X T P C X, 


( Ini 



\ On* 


s, = ^rj Y t P c Y and S xy = ^X T P C Y to be 
the sample estimates of X and y C 1Z 9 ~ 1 space 
covariance matrices and S y , respectively, and 
the cross-product covariance matrix £ xy . The ma- 
trix P c = (I n — %l n in) i- s use( ^ centralize the 
data. Further, let H = J2i=i n i(*i ~ x)(x* — x) T 


and E = Ef=i E?=i( x ^ - x<)(x^ - Xi) T represent 
the among-classes and within- classes sums-of-squares, 
where ± EjLi x i , x = £ £?=i E"=i x i and x ; 
represents a ^-dimensional vector for the j th sample 
in the i th class. 


CCA is a method which finds a pair of linear transfor- 
mations of each block of data with maximal correla- 
tion coefficient. This can be formally described as the 
maximization problem 

max r T Sir=s T 2 vS= i[corr(Xr, Ys)] 2 = 

= [corr (Xa,Yb)] 2 = 

= [ccw(Xa, Yb)] 2 /[var(Xa)r;or(Yb)] 


where similar to our previous notation the symbols 
corr and var denote the sample correlation and vari- 
ance, respectively. An estimate of the weight vector 
a is given as the solution of the following eigenvalue 
problem (Mardia et al., 1997) 

S x S X ySy Sy X & = Aa 

where the eigenvalues A corresponds to the squared 
canonical correlation coefficient. 

Without the assumption of Gaussian distribution of 
individual classes, Fisher developed a discrimination 
method based on a linear projection of the input data 
such that among-classes variance is maximized rela- 
tive to the within-classes variance. The directions onto 
which the input data are projected are given by the 
eigenvectors a of the eigenvalue problem 

E _1 Ha = Aa 


In the case of two-class discrimination with multi- 
normal distributions with the same covariance matri- 
ces, Fisher’s LDA finds the same discrimination direc- 
tion as LDA using Bayes theorem to estimate posterior 
class-probabilities— -the method providing the discrim- 
ination rule with minimal expected misclassification 
error (Mardia et al., 1997; Hastie et al., 2001). 

The connection between Fisher’s LDA directions and 
the directions given by CCA using a dummy matrix Y 
for group membership was first recognized by Bartlett. 
This connection expressed using the previously defined 
notation was formulated by Barker and Rayens (2003) 
(see also 11.5.4, Mardia et al., 1997) in the following 
two theorems: 


Theorem 1 
Theorem 2 

S'^S^S^a = Aa ^ E -1 Ha = ^a 

The proof of the first theorem can by found in Barker 
and Rayens (2003). Using the property of the general- 
ized eigenvalue problem, Theorem 1 and the fact that 
(n — 1)S X = E-f-H, the second theorem can be proved. 

A very close connection between Fisher’s LDA, CCA 
and PLS-SB methods for multi-class discrimination 
has been shown in Barker and Rayens (2003). This 
connection is based on the fact that PLS can be seen 
as a form of penalized CCA. 

[ccw(t, u )] 2 = [cou(Xw, Yc)] 2 = 

= unr(Xw)[corr(Xw, Yc)] 2 uur(Yc) 

with penalties given by PCA in X- and ^-spaces. 
Barker and Rayens (2003) suggested to remove the 
not meaningful y-space penalty var (Yc) in the PLS- 
SB discrimination scenario. This modification in fact 
represents a special case of the previously proposed or- 
thonormalized PLS method (Worsley, 1997) using the 
indicator matrix Y. In this case (4) is transformed 
into the eigenvalue problem 

X t Y(Y t Y)‘ 1 Y t Xw = Aw (10) 

Using Theorem 1 and the fact that S xy = (n — 1)X T Y 
and S y = (n — 1)Y T Y the eigenvectors of (10) are 
equivalent to the eigensolutions of 

Hw — Aw (11) 

Thus, this modified PLS method is then based on 
eigensolutions of the among-classes sum-of-squares 
matrix H which connects this approach to CCA or 
equivalently to Fisher’s LDA (Barker & Rayens, 2003). 

5. Kernel PLS-SVC for Discrimination 

The connection between CCA, Fisher’s LDA and PLS 
motivates the use of the orthonormalized PLS method 
for discrimination. The kernel variant 1 of this ap- 
proach will transform (8) into the following equations 

KY(Y T Y)“ 1 Y T t - KYY r t = At 
u = YY T t U j 

where Y = Y(Y T Y)“' 1 / 2 represents a matrix of un- 
correlated and normalized original output variables. 

1 Both, linear and nonlinear PLS are considered. In the 
case of linear kernel a feature space J~ is equivalent to X 
and in this case the kernel variant of PLS will be preferable 
only if N > n. 


Interestingly, in the case of two-class discrimination 
the direction of the first kernel orthonormalized PLS 
score vector t is identical with the first PLS score 
vector found by either the kernel PLS1 or the ker- 
nel PLS-SB method. This immediately follows from 
the fact that Y T Y is a number in this case. In this 
two-class scenario KYY T is of a rank one matrix and 
kernel PLS-SB extracts only one PLS score vector t. 
In contrast kernel PLS1 can extract additional, up to 
the rank of K, input space score vectors each pos- 
sessing the same similarity with directions computed 
with CCA and Fisher’s LDA on deflated feature space 
matrices. This provides more principled dimensional- 
ity reduction in comparison to standard PC A based 
on the criterion of maximum data variation in the T- 
space alone. 

In the case of multi-class discrimination the rank of the 

Y matrix is equal to g — 1 which determines maximum 
number of score vectors which may be extracted by the 
kernel orthonormalized PLS-SB method. Again, simi- 
lar to one-dimensional output scenario the deflation of 

Y matrix at each step can be done using input space 
score vectors t; that is, in the defined discrimination 
based scenario using the kernel orthonormalized PLS2 
method. This different deflation used in PLS2 in com- 
parison to PLS-SB leads to the loss of the clear con- 
nection to CCA and Fisher’s LDA as defined by the 
eigenproblem (11). However, a sequence of PLS score 
vectors up to the rank of K can be extracted and the 
similarity to (11) is maintain in the sense of the mod- 
ified "among-classes sum-of-squares matrix computed 
on deflated input and output spaces at each step. 

Kernel variants of CCA and Fisher LDA have been 
proposed (Lai & Fyfe, 2000; Mika et al., 1999). Al- 
though, the same relations among CCA, Fisher LDA 
and PLS in a feature space T induced by a used ker- 
nel function can be considered, both kernel CCA and 
kernel Fisher DA suffer from singularity problem in 
the case of a higher dimensionality S > n. Both algo- 
rithms at some point need to invert singular matrices 
which is avoided by using the regularization concept of 
adding a small ridge (jitter) parameter on the diagonal 
of those matrices. The connection between a regular- 
ized form -of CGA r ELS and orthonormalizecLPLS was 
developed in the context of canonical ridge analysis by 
Vinod (1976). 

On several classification problems the use of kernel 
PC A for dimensionality reduction or de-noising fol- 
lowed by linear SVC computed on this reduced T - 
space data representation has shown good results in 
comparison to nonlinear SVC using the original data 
representation (Scholkopf & Smola, 2002; Scholkopf 


et al., 1998). However, results of the previous section 
suggest to replace the kernel PC A data preprocess- 
ing step with a more principled kernel orthonormal- 
ized PLS approach. In comparison to kernel Fisher 
DA. this may become more suitable in the situation of 
non-Gaussian class distribution in a feature space T 
where more than g — 1 discrimination directions may 
better define an overall discrimination rule. The ad- 
vantage of using linear SVC as the follow up step is 
motivated by the construction of an optimal separating 
hyperplane in the sense of maximizing of the distance 
to the closest point from either class (Vapnik, 1998; 
Scholkopf & Smola, 2002). Moreover, when the data 
are not separable the SVC approach provides a way to 
control the extent of this overlap. Thus, the kernel or- 
thonormalized PLS is combined with the 1 /-SVC or the 
C-SVC (Scholkopf & Smola, 2002) classifier and this 
methodology is denoted as kernel PLS-SVC. A short 
pseudo code of the method is provided in Appendix. 

6. Experiments 

The usefulness of the kernel PLS-SVC method was 
tested on several benchmark data sets of two-class 
classification and and on a real world problem of dis- 
criminating finger movements versus periods of non- 
movement based on electroencephalogram (EEG). 

6.1. Benchmark Data Sets 

The data sets used in Ratsch et al. (2001); 
Mika et al. (1999) were chosen. The data sets 
are freely available and can be downloaded from 
http://www.f irst.gmd.de/~raetsch. The datasets 
consist of 100 different training and testing partitions 
(except Splice and Image, consisting from 20 parti- 
tions). In all cases the Gaussian kernel was used. 
The unknown parameters (width of the Gaussian ker- 
nel, number of PLS score vectors, v and C parame- 
ters for y-SVC and C-SVC, respectively) were selected 
based on the minimum classification error using five- 
fold cross validation (CV) on the first five training sets. 

The results are summarized in Table 1. Very good 
behavior of kernel PLS-SVC method can be observed. 
-The -null hypothesis about equal means using the C- .. 
SVC and kernel PLS-SVC methods was tested using 
paired t-test (the individual test set classification er- 
rors for kernel Fisher DA are not available) . The non- 
parametric sign and Wilcoxon matched-pairs signed- 
ranks tests were also used to test null hypotheses about' 
the direction and size of the differences within pairs. 
On six data sets (Banana, Breast Cancer, Diabetes, 
Ringnorm, Twonorm, Waveform) the hypothesis was 
rejected with the p-values < 0.08. The one-sided al- 


ternative of both nonparametric tests indicated lower 
classification errors of the kernel PLS-SVC approach 
with the p-values < 0.001 in all six cases. Although, 
paired t-test did not rejected the null hypothesis (p- 
vaiue = 0.3 ) on the Heart data set, the one-sided 
alternative of the nonparametric tests indicate lower 
classification errors using C-SVC (p- values = 0.009). 
The number of selected kernel PLS components deter- 
mined by the used CV approach was lower than 10 
except for the Image data set where 27 score vectors 
were used. The significant improvement in terms of av- 
eraged classification error over kernel Fisher DA can 
be seen in this case. Interestingly, this superiority of 
kernel PLS-SVC over kernel Fisher DA was also ob- 
served in the case when only one PLS score vector was 
used (German, Ringnorm, Twonorm) but not on the 
Heart data set. 

Table 1. Comparison of the mean and standard deviation 
test set classification errors between kernel Fisher DA 
(Mika et al. 1999), C-SVC (Ratsch et al., 2001) and ker- 
nel PLS-SVC (asterisks indicate data sets where C-SVC 
was used in contrast to I'-SVC used on the remaining data 
sets). The method with minimum averaged classification 
error is highlighted in bold. The last row represents the 
mean of the values computed as the rate between the av- 
eraged classification error of a method and the averaged 
classification error of the best method on a particular data 
set minus one. 


Data Set 

KFD 

C-SVC 

KPLS-SVC 

Banana 

10.810.5 

11.510.5 

10.510.4 

B. Cancer 

25.814.6 

26.0±4.7 

25.114.5* 

Diabetes 

23.2dbl.6 

23.511.7 

23. Oil. 7 

German 

23.712.2 

23.612.1 

23.511.6 

Heart 

16.113.4 

16.013.3 

16.513.6 

Image 

4.7610.58 

2.9610.60 

3.0310.61 

Ringnorm 

1.4910.12 

1.66±0.12 

1.43±0.10 

F. Solar 

33.211.7 

32.411.8 

32.411.8 

Splice 

10.510.6 

10.910.7 

10.910.8 

Thyroid 

4.20dt2.07 

4.80±2.19 

4.3912.10 

Titanic 

23.212.06 

22.411.0 

22.4±1.1* 

Twonorm 

2.6110.15 

2.9610.23 

2.3410.11 

Waveform 

9.86±0.44 

9.8810.43 

9.58±0.36 


mean % 7.2±16.4 6.118.2 1.111.7 


6.2. Finger Movement Detection 

In the Brain-Computer Interface project (Trejo et al., 
2003) oriented finger movement detection experiment 
the subject was instructed to perform a self-paced sin- 
gle finger tap every five seconds. In four runs the sub- 
ject was instructed to alternate between the pinkie and 
index fingers on a single hand, half of those runs were 
left and half were right hand only. In two runs the 


subject was then instructed to alternate between both 
hands keeping the same time separation between taps. 
Each run contained approximately 50 single taps. 62 
channel EEG and 2 channel eiectrooculogram were 
recorded using a Neuroscan 64 channel EEG cap with 
two 32 channel syn-amps sampled at 1000 Hz. Elec- 
tromyogram was also recorded using two electrodes 
placed on each wrist. The raw EEG was cut into one 
second intervals with 300 ms before the motion and 
700 ms after the beginning of the motion. The inter- 
vals were dowm-sampled from 1000 to 128 data points 
using the Matlab routine resample. Both right and 
left hand intervals were labeled as motion and classi- 
fied against periods of non-motion of equal length us- 
ing the kernel PLS-SVC classifier and i/-SVG classifier 
alone. The experiment with the same subject was re- 
peated two times with the interval of 56 days between 
the sessions. These days are denoted day 1 and day 
2, respectively. Due to the impedance problems with 
one of the electrode (Oi) during the second day session 
only 61 channels of EEG were used. In total 225 pe- 
riods of movement and 579 periods of non-movement 
was extracted for day 1 and 288 movement periods 
versus 657 non-movement periods for day 2. The di- 
mensionality of each period was 7808 (61 electrodes 
times 128 time points). 

The accuracy to classify both finger movement and 
non-movement periods on data measured during day 
2 was based on the linear kernel PLS-SVC model. The 
model was trained on day 1 data. The same was done 
using day 2 data to predict day 1. The parameters for 
the kernel PLS-SVC models were estimated using 10- 
fold CV on each day’s data separately. First, the num- 
ber of PLS score vectors was fixed and the v parameter 
was estimated. In Fig. 1 the dependence of the correct 
classification rate on the number of selected PLS score 
vectors is depicted. The asterisks indicate correct clas- 
sification rate when the final number of the PLS score 
vectors was determined using the CV approach. The 
graphs indicate that a classification accuracy of about 
90% can be achieved. Using a range of v values for 
1 /-SVC a maximum correct classification rate for the 
day 1 to day 2 scenario was 93.0% and 90.7% for the 
day 2 to day 1 scenario. The results with the nonlinear 
kefneTPLS-SVC model 'using Gaussian kernelindicate 
slight improvement in comparison to its linear variant. 

In the proposed discrimination scenario the individual 
PLS score vectors can be considered as different spatio- 
temporal processes with respect to differentiation be- 
tween movement and non-movement periods. In the 
case of linear kernel PLS the corresponding weight 
vectors w (eq. (4)) can be computed and their val- 
ues reflect important time points and spatial locations 


with respect to discrimination. Using a head model 
implemented in Neuroscan software these weight vec- 
tors were expressed as the scalp topographical maps 
at different time points (Fig. 2). Based on the close 
visual inspection of these maps 16 EEG channels were 
selected out of all 61 channels. The selected electrodes 
were located predominantly over the right-hand side 
sensori-motor area. Four electrodes from occipital area 
(two on each side) were also selected. The results using 
this reduced number of electrodes are plotted in Fig. 1. 
In both cases the plots indicate comparable results to 
these achieved with the full EEG montage. However, 
in the second case when the day 2 to day 1 prediction 
scenario was used a reduced setting of the electrodes 
has a tendency of overfitting for a higher number of 
used components. To further justify this electrode re- 
duction 100 different training and testing partitions 
with the ratio of splits 40:60% were created. This was 
done for each day independently. Using 10-fold CV on 
training partitions both reduced and full EEG mon- 
tage, linear kernel PLS-SVC models were compared 
in terms of correct classification rates achieved on 100 
test partitions. For day 1 the averaged correct classi- 
fication for reduced set of electrodes was 89.5% ±1.7 
in comparison to 88.2% ±1.4 using the full EEG mon- 
tage. For day 2 the results were 91.8% ±1.5 and 91.8% 
±1.1, respectively. Using the paired Mest null hypoth- 
esis about equal means was rejected for day 1 (p - value 
< 0.001) but not for day 2. One-sided alternative of 
nonparametric tests indicate superiority of the reduced 
approach for day 1 (p - value < 0.001). 



Figure 1. Comparison of the full 61 EEG electrodes setting 
(dash-dotted line) with the reduced setting of 16 electrodes 
(solid line). Asterisks show results achieved in the case 
where 10-fold CV was used to select a number of score 
vectors and the u parameter for i'-SVC. Top: the day 1 to 
day 2 scenario. Bottom: the day 2 to day 1 scenario. 



Figure 2. Topographic scalp projections of the first PLS 
weight vector at the time point 370ms after the onset of 
movement. Left : day 1 data. Right: day 2 data. 

7. Conclusions 

A new Kernel PLS-SVC discrimination technique was 
proposed. Results achieved on 13 benchmark data sets 
demonstrate usefulness of the proposed method and 
its competitiveness with other state-of-the-art classifi- 
cation methods. On six benchmark data sets a statis- 
tically significant superiority of kernel PLS-SVC over 
C-SVC was observed. In contrast, this tendency was 
observed only in one case for C-SVC. In terms of aver- 
aged classification error the superiority of kernel PLS- 
SVC over kernel Fisher DA was observed in 10 out 
of 13 benchmark data sets. In seven cases this was 
achieved using more than one score vector, which give 
rise to the question, that a single direction extracted 
by kernel Fisher DA on these data sets may not be 
adequate to discriminate two different classes. 

In the case of finger movement detection from EEG, 
a linear kernel PLS approach provided a way to — in 
practice desirable — reduce the number of used elec- 
trodes without the degradation of the classification 
accuracy. It is the topic of a current more detailed 
study to analyze the individual spatio-temporal pro- 
cesses as defined by the extracted PLS score vectors. 
This would provide a more principled way for the se- 
lection of important spatial and temporal changes dur- 
ing the finger motion. The topographical maps con- 
structed using the weight vectors of the constructed v- 
SVC models (one weight vector for each model) have 
shown similar i ty b etw een the plo ts u sing thejweight 
vectors corresponding to the first PLS score vectors. 
However, these weight vectors represent a “global” dis- 
crimination of spatio-temporal processes. Moreover, 
they are computed using the support vectors only. 

Theoretical connection between Fisher LDA, CCA and 
PLS was described. This connection indicates, that in 
the case of dimensionality reduction with respect to 
discrimination in T the kernel orthonormalized PLS 
method should be preferred over kernel PCA. 
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Appendix 


In the case of one-dimensional output SIMPLS algo- 
rithm provides the same solution than PLSl. Thus, for 
two-class discrimination a computationally more effi- 
cient canonical form of SIMPLS algorithm can be used 
(de Jong et ah, 2001). This is based on the fact that 
in this case t oc KY. The kernel PLS-SVC algorithm 
can be then defined in three major steps: 

1) kernel PLS components extraction 

compute K - centralized Gram matrix (eq. 9), 
set K re5 = K, p - the number of score vectors 
for 2 — 1 to p 
t< = K res Y 


Ijtil! 1 

u t = Y(Y T ti) 


K re $ 

Y 


end 


i— K res — t{(tf K res ) 
Y — tj(tfY) 


T = [ti,t 2) . . . , t p ]; U = [ui, u 2 , . . . , Up] 

2) projection of test samples (Rosipal & Trejo, 2001) 

compute K t - centralized test set Gram matrix 
T t =KtU(T T KU)“ 1 

3) Z/-SVC or C-SVC build on score vectors T, T* 


In the case of multi-class discrimination (g > 2) kernel 
variant of the NIPALS algorithm JRosipal k Trejo, 
2001) with uncorrelated outputs Y or eigenproblem 
(12) has to be solved to extract {t.;}£\" and . 



